Analysis of Zillow Home Value Index across Time and Location

Gaurav Rimal, Mike Lupo, Sean Moadel

Introduction:

Moving is something that almost everyone experiences at least once in their lifetime. Whether it be for a job, family, or any other reason, it is important to understand the current state of the housing market. As college students who will soon be looking for jobs, understanding the housing market is important for us to consider while we think about relocation. This topic should not only be important to us, but everyone who is considering moving especially now with the current tightness of the housing market.

According to the Federal Reserve, the housing market has tightened considerably. The supply of homes available for sale has fallen to historically low levels and home price growth has increased greatly during the pandemic. In this tutorial, we aim to analyze how home values have changed in the 21st century across the United States. We will be measuring the changes in Zillow Home Value Index (ZHVI) which is a seasonally adjusted measure of typical home value and market changes.

ZHVI is a measure provided by Zillow Inc. that measures two key variables in the current housing market as well as over time. These two variables are home value and housing market appreciation. For more details about ZHVI and how it is calculated, visit this link. Another link that may be helpful is the ZHVI User Guide.

Using the ZHVI dataset, our goal is to predict future ZHVI and see whether whether the size of the region affects its ZHVI. In order to do this, we will be going through the data science lifecycle which has five parts:

  1. Data Collection
  2. Data Processing
  3. Exploratory Analysis and Data Visualization
  4. Analysis, Hypothesis Testing, and Machine Learning
  5. Insight and Policy Decision

Table of Contents

Data Collection

This data was obtained from https://www.zillow.com/research/data/ under Home Values.

First, let's import some libraries that will be using:

If you do not have any of these libraries installed, you can install them by entering pip3 install [package]. If you need any more information, you can see the documentation or a tutorial for each library listed below:

Data Processing

Now, let's clean the data so we can extract only the relevant details such as Dates, Locations and the ZHVI. By getting rid of unecessary information, we get a much cleaner dataset that is easier to read.

The data now looks clean but there could be some missing values. Let's check! Using the Pandas library, we can check if there are any missing values in the dataframe using the code below.

In order to impute the missing values, we are going to calculate the average ZHVI for the whole state on that date.

There are still some missing values. Let's fill those with average ZHVI for the whole region across our window of time. After we fill in these values, we now have 0 missing values and our dataset is almost ready to be explored.

Let's just calculate the average ZHVI for states as well.

Finally, the data is ready to be disected. Lets move on to anaylsis.

Exploratory Analysis & Data Visualization

To start off we are going to visualize how the ZHVI changed per state over time. To do this, we are going to use the Matplotlib library to create a scatterplot with each datapoint on it.

We can see that states like New York and Hawaii have had premium ZHVI for almost all the time. ZHVI seems to peak at times of crisis like the 2008 Recession and the 2019 Covid Pandemic. For more information on the current state of the housing market as a result of COVID-19, visit this link.

The next thing we are going to visualize is the end of year ZHVI for every complete year. This will allow us to see trends in the data that were more difficult to see in our previous graph.

The average ZHVI for the Year has a net increase from 2000 to 2020. The spread of ZHVI is also getting much bigger than we have seen in the earlier part of the decade.

Lets look at average ZHVI per region size:

From this visualization we can see that the region size does not seem to have a connection with ZHVI. There are small, large, and medium sized regions with high ZHVIs which suggests that the relationship between region size and ZHVI is very weak if existent at all.

The last thing we are going to visualize is the average ZHVI across states. From this graph we will be able to see which of the states have the highest ZHVI values.

Hawaii, DC, California and Massachusetts have some of the most valuable homes. It is no surprise that Hawaii has the highest ZHVI due to it being a tropical area. The same applies for California which is also known for sunshine and palm trees. This suggests that things like climate and weather may have a connection to ZHVI, which is something that we could test in the future.

Analysis, Hypothesis Testing, & ML

The fourth step in the data science lifecycle is analysis, hypothesis testing, and machine learning. In this part, we are going to use the data and determine if we should reject or fail to reject our hypothesis.

To start off, we are going to create a polynomial model for our data that shows ZHVI versus region size rank.

Interpretation: Insight & Policy Decision

The last part of the data science lifecycle is interpretation. This step involves understanding the data and operations you have done, and coming to conclusions.

The graph above shows that the model clearly does not fit the data, and the data shows that region size and Zillow Home Value Index do not seem to be correlated. We can see in our OLS model that our r-squared value is 0.084. This low r-squared value suggests that very little of the average ZHVI movement per region is accounted for by region size.

If we were to do this again, we should find more data and run a similar test to see if we can get a stronger r-squared value. Some examples for variables to use in future tests would be climate, weather, square footage, and many more.

Due to the poor results of our model, it would be very difficult to predict ZHVI values and the results would probably not be very accurate.

However, it is still apparent that the housing market hits peak prices during times of crisis, such as the COVID-19 pandemic or the 2008 recession. This is likely a result of people feeling safer in their current homes and not wanting to sell. The law of supply and demand dictates that with less people wanting to sell their homes, the prices will rise which is why we saw the trends that we did in our earlier visualizations.

Conclusion

While the results of our study were inconclusive, we hope you learned the data science lifecycle which can be applied to almost any dataset.

Throughout this tutorial we have covered how to use a number of data science assets such as pandas, numpy, and matplotlib. We have shown you how to read and clean data, and then visualize it to see trends. We have also shown you how to create a model, run hypothesis tests using that model, and understand your results.